4  Re-weighting to a target population

Author

Janick Weberpals, RPh, PhD

Published

September 10, 2024

This is a reproducible example on how to incorporate population weights to match distributions of a target population in multiple imputation > matching/weighting > balance assessment > outcome analysis workflows.

Load packages:

Show the code
library(here)
library(dplyr)
library(survival)
library(mice)
library(MatchThem)
library(MatchIt)
library(survey)
library(gtsummary)

# calling functions
source(here::here("functions", "source_encore.io_functions.R"))

# track time
runtime <- tictoc::tic()

4.1 Data generation

We use the simulate_flaura() function to simulate a realistic oncology comparative effectiveness cohort analytic dataset.

Show the code
# load example dataset with missing observations
data_miss <- simulate_flaura(
  n_total = 3500, 
  treat_prevalence = .51, 
  seed = 41, 
  include_id = FALSE, 
  imposeNA = TRUE
  ) |> 
  # we have to convert sex and ecog into a factor variable 
  # since anesrake doesn't accept 0/1 numeric encoding for 
  # binary variables
  mutate(across(c(dem_sex_cont, c_ecog_cont), function(x) factor(as.character(x))))

covariates <- data_miss |> 
  select(starts_with("c_"), starts_with("dem_")) |> 
  colnames()

4.2 Multiple imputation

Multiple imputation using mice:

Show the code
# impute data
data_imp <- futuremice(
  parallelseed = 42,
  n.core = 7,
  data = data_miss,
  method = "rf",
  m = 10,
  print = FALSE
  )
Warning in check.cores(n.core, available, m): 'n.core' exceeds the maximum
number of available cores on your machine or the number of imputations, and is
set to 3

4.3 Defining target distributions

Before applying the re-weighting, we need to define the target distributions of patient characteristics that we want to match from the clinical trial using the raking procedure. The following distributions are taken from Table 1 of the FLAURA trial.

FLAURA trial Table 1; in OS analysis race was simplified to Asian vs. non-Asian
Show the code
# Define FLAURA distributions for key covariates --------------------------
# order is as in Table 1

## sex ---------------------------------------------------------------------

# female (0) to male (1) proportion:
sex_target <- c(.63, .37) 
names(sex_target) <- c("0", "1")

## race --------------------------------------------------------------------
# asian, non-asian
# asian (TRUE) to non-asian (FALSE) proportion
# note: logical variables in dataframe can be matched to a numeric vector of length 2 and ordered with the TRUE target as the first element and the FALSE target as the second element.
race_target <- c(.62, .38)

## smoking -----------------------------------------------------------------

# current/former smoker (TRUE) to never smoker (FALSE) proportion
# note: logical variables in dataframe can be matched to a numeric vector of length 2 and ordered with the TRUE target as the first element and the FALSE target as the second element.
smoker_target <- c(.35, .65)

## ecog --------------------------------------------------------------------

# ecog 0 by exposure 
avg_prop_ecog0 <- .41

# ecog 0 to ecog 1 proportion
ecog_target <- c(.41, .59)
names(ecog_target) <- c("0", "1")


# summarize target distributions in a named list vector --------------
targets <- list(sex_target, race_target, smoker_target, ecog_target)
names(targets) <- c("dem_sex_cont", "dem_race", "c_smoking_history", "c_ecog_cont")

# print
targets
$dem_sex_cont
   0    1 
0.63 0.37 

$dem_race
[1] 0.62 0.38

$c_smoking_history
[1] 0.35 0.65

$c_ecog_cont
   0    1 
0.41 0.59 

4.4 Propensity score matching and re-weighting

In this step, propensity score matching and re-weighting of key patient characteristics to match those of the original RCT is performed across all imputed datasets.

The propensity score model is specified as follows:

Show the code
# apply propensity score matching on mids object
ps_form <- as.formula(paste("treat ~", paste(covariates, collapse = " + ")))
ps_form
treat ~ c_smoking_history + c_number_met_sites + c_ecog_cont + 
    c_stage_initial_dx_cont + c_hemoglobin_g_dl_cont + c_urea_nitrogen_mg_dl_cont + 
    c_platelets_10_9_l_cont + c_calcium_mg_dl_cont + c_glucose_mg_dl_cont + 
    c_lymphocyte_leukocyte_ratio_cont + c_alp_u_l_cont + c_protein_g_l_cont + 
    c_alt_u_l_cont + c_albumin_g_l_cont + c_bilirubin_mg_dl_cont + 
    c_chloride_mmol_l_cont + c_monocytes_10_9_l_cont + c_eosinophils_leukocytes_ratio_cont + 
    c_ldh_u_l_cont + c_hr_cont + c_sbp_cont + c_oxygen_cont + 
    c_neutrophil_lymphocyte_ratio_cont + c_bmi_cont + c_ast_alt_ratio_cont + 
    c_time_dx_to_index + c_de_novo_mets_dx + c_height_cont + 
    c_weight_cont + c_dbp_cont + c_year_index + dem_age_index_cont + 
    dem_sex_cont + dem_race + dem_region + dem_ses

The matching and re-weighting is performed using the re_weight() function. This function is a wrapper for matchit() and weightit() in combination with the anesrake() function which performs the raking (= re-weighting) procedure.

We apply this function to each imputed dataset. Before doing so, the imputed datasets, which are currently stored as a mids object, needs to be converted to a list of dataframes:

Show the code
# create a mild object containing lists of data.frames
data_mild <- mice::complete(data = data_imp, action = "all", include = FALSE)

summary(data_mild)
   Length Class      Mode
1  39     data.frame list
2  39     data.frame list
3  39     data.frame list
4  39     data.frame list
5  39     data.frame list
6  39     data.frame list
7  39     data.frame list
8  39     data.frame list
9  39     data.frame list
10 39     data.frame list

The lapply function loops the function through each dataframe and returns a list of matchit objects which contain imputed > matched > re-weighted datasets. To take advantage of the features that come with the cobalt and matchthem packages, the function stores the raking weights as sampling weights (s.weights).

Show the code
# call match re-weight
matchit_out_list <- lapply(
  # list of dataframes
  X = data_mild, 
  # call function
  FUN = re_weight,
  # target distributions
  targets = targets,
  # should matching or weighting be performed
  matching_weighting = "matching",
  # matching arguments passed on to matchit() function
  formula = ps_form,
  ratio = 1,
  method = "nearest",
  distance = "glm",
  link = "linear.logit",
  caliper = 0.05,
  replace = F
  )
[1] "Raking converged in 10 iterations"
[1] "Raking converged in 10 iterations"
[1] "Raking converged in 8 iterations"
[1] "Raking converged in 12 iterations"
[1] "Raking converged in 13 iterations"
[1] "Raking converged in 10 iterations"
[1] "Raking converged in 13 iterations"
[1] "Raking converged in 13 iterations"
[1] "Raking converged in 12 iterations"
[1] "Raking converged in 11 iterations"
[1] "Raking converged in 12 iterations"
[1] "Raking converged in 10 iterations"
[1] "Raking converged in 11 iterations"

We can inspect the output of the first imputed > matched > re-weighted dataset.

Show the code
matchit_out_list[[1]]
A matchit object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score [caliper]
             - estimated with logistic regression and linearized
             - sampling weights not included in estimation
 - caliper: <distance> (0.175)
 - number of obs.: 3500 (original), 380 (matched)
 - sampling weights: present
 - target estimand: ATT
 - covariates: c_smoking_history, c_number_met_sites, c_ecog_cont, c_stage_initial_dx_cont, c_hemoglobin_g_dl_cont, c_urea_nitrogen_mg_dl_cont, c_platelets_10_9_l_cont, c_calcium_mg_dl_cont, c_glucose_mg_dl_cont, c_lymphocyte_leukocyte_ratio_cont, c_alp_u_l_cont, c_protein_g_l_cont, c_alt_u_l_cont, c_albumin_g_l_cont, c_bilirubin_mg_dl_cont, c_chloride_mmol_l_cont, c_monocytes_10_9_l_cont, c_eosinophils_leukocytes_ratio_cont, c_ldh_u_l_cont, c_hr_cont, c_sbp_cont, c_oxygen_cont, c_neutrophil_lymphocyte_ratio_cont, c_bmi_cont, c_ast_alt_ratio_cont, c_time_dx_to_index, c_de_novo_mets_dx, c_height_cont, c_weight_cont, c_dbp_cont, c_year_index, dem_age_index_cont, dem_sex_cont, dem_race, dem_region, dem_ses

4.5 Table 1

To check if the re-weighting process worked, we can extract the matched patients and compare a Table 1 that does not include the weights vs. a Table that considers the weights. For this example, we look at the first imputed > matched > re-weighted dataset.

Show the code
# extract the matched of
first_dataset <- get_matches(
  object = matchit_out_list[[1]]
  )

Reminder : The target distributions look like this

Show the code
targets
$dem_sex_cont
   0    1 
0.63 0.37 

$dem_race
[1] 0.62 0.38

$c_smoking_history
[1] 0.35 0.65

$c_ecog_cont
   0    1 
0.41 0.59 
Show the code
library(cardx)
library(smd)

# print
first_dataset |>
  tbl_summary(
    by = treat,
    include = names(targets)
    ) |> 
  add_difference(test = dplyr::everything() ~ "smd") |>
  add_overall() |>
  modify_column_hide(columns = "conf.low") |> 
  modify_header(
    label ~ "**Patient characteristic**",
    stat_0 ~ "**Total** <br> N = {round(N, 2)}",
    stat_1 ~ "**{level}** <br> N = {round(n, 2)} <br> ({style_percent(p, digits=1)}%)",
    stat_2 ~ "**{level}** <br> N = {round(n, 2)} <br> ({style_percent(p, digits=1)}%)"
    ) |>
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment received**")

Patient characteristic

Total
N = 380

1

Treatment received

Difference

2

0
N = 190
(50.0%)

1

1
N = 190
(50.0%)

1
dem_sex_cont


0.03
    0 253 (67%) 128 (67%) 125 (66%)
    1 127 (33%) 62 (33%) 65 (34%)
dem_race 226 (59%) 106 (56%) 120 (63%) -0.15
c_smoking_history 185 (49%) 94 (49%) 91 (48%) 0.03
c_ecog_cont


0.01
    0 175 (46%) 88 (46%) 87 (46%)
    1 205 (54%) 102 (54%) 103 (54%)
1

n (%)

2

Standardized Mean Difference

Show the code
# create survey object 
data_svy <- svydesign(ids = ~ 1, weights = ~ weights, data = first_dataset)

# print
data_svy |>
  tbl_svysummary(
    by = treat,
    include = names(targets)
    ) |> 
  add_difference(test = dplyr::everything() ~ "smd") |>
  add_overall() |>
  modify_column_hide(columns = "conf.low") |> 
  modify_header(
    label ~ "**Patient characteristic**",
    stat_0 ~ "**Total** <br> N = {round(N, 2)}",
    stat_1 ~ "**{level}** <br> N = {round(n, 2)} <br> ({style_percent(p, digits=1)}%)",
    stat_2 ~ "**{level}** <br> N = {round(n, 2)} <br> ({style_percent(p, digits=1)}%)"
    ) |>
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment received**")

Patient characteristic

Total
N = 380

1

Treatment received

Difference

2

0
N = 187.36
(49.3%)

1

1
N = 192.64
(50.7%)

1
dem_sex_cont


0.03
    0 239 (63%) 119 (64%) 120 (62%)
    1 141 (37%) 68 (36%) 73 (38%)
dem_race 236 (62%) 109 (58%) 127 (66%) -0.16
c_smoking_history 133 (35%) 68 (36%) 65 (34%) 0.06
c_ecog_cont


0.06
    0 156 (41%) 80 (43%) 76 (40%)
    1 224 (59%) 108 (57%) 117 (60%)
1

n (%)

2

Standardized Mean Difference

4.5.1 Comparison of custom function vs. matchthem()

Lastly, we want to make sure that our custom function results in the same matched datasets as the matchthem() function which we use in the main analysis - not considering the re-weighting.

For this demonstration, we use the same matching parameters, but without re-weighting after matching in our custom function.

4.5.2 Custom function

We run again our custom function but with targets = NULL to not re-weight any of the included covariates. To convert the returned output of a list of matchit objects into an object of type mimids we use the MatchThem::as.mimids() function.

Show the code
# call match re-weight
set.seed(42)
matchit_out_list <- lapply(
  X = data_mild, 
  FUN = re_weight,
  targets = NULL,
  matching_weighting = "matching",
  formula = ps_form,
  ratio = 1,
  method = "nearest",
  distance = "glm",
  link = "logit",
  caliper = 0.05,
  replace = F
  )
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
Show the code
# convert the output into a mimids object
data_mimids_from_function <- MatchThem::as.mimids(
  x = matchit_out_list, 
  datasets = data_imp
  )

data_mimids_from_function
Printing               | dataset: #1
A matchit object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score [caliper]
             - estimated with logistic regression
 - caliper: <distance> (0.022)
 - number of obs.: 3500 (original), 376 (matched)
 - target estimand: ATT
 - covariates: c_smoking_history, c_number_met_sites, c_ecog_cont, c_stage_initial_dx_cont, c_hemoglobin_g_dl_cont, c_urea_nitrogen_mg_dl_cont, c_platelets_10_9_l_cont, c_calcium_mg_dl_cont, c_glucose_mg_dl_cont, c_lymphocyte_leukocyte_ratio_cont, c_alp_u_l_cont, c_protein_g_l_cont, c_alt_u_l_cont, c_albumin_g_l_cont, c_bilirubin_mg_dl_cont, c_chloride_mmol_l_cont, c_monocytes_10_9_l_cont, c_eosinophils_leukocytes_ratio_cont, c_ldh_u_l_cont, c_hr_cont, c_sbp_cont, c_oxygen_cont, c_neutrophil_lymphocyte_ratio_cont, c_bmi_cont, c_ast_alt_ratio_cont, c_time_dx_to_index, c_de_novo_mets_dx, c_height_cont, c_weight_cont, c_dbp_cont, c_year_index, dem_age_index_cont, dem_sex_cont, dem_race, dem_region, dem_ses

4.5.3 matchthem() function

The following code resembles the code we would use in the main analysis by implementing the generic matchthem() function.

Show the code
# matching
set.seed(42)
data_mimids <- matchthem(
  datasets = data_imp,
  formula = ps_form,
  ratio = 1,
  method = "nearest",
  distance = "glm",
  link = "logit",
  caliper = 0.05,
  replace = F
  )

Matching Observations  | dataset: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Show the code
data_mimids
Printing               | dataset: #1
A matchit object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score [caliper]
             - estimated with logistic regression
 - caliper: <distance> (0.022)
 - number of obs.: 3500 (original), 376 (matched)
 - target estimand: ATT
 - covariates: c_smoking_history, c_number_met_sites, c_ecog_cont, c_stage_initial_dx_cont, c_hemoglobin_g_dl_cont, c_urea_nitrogen_mg_dl_cont, c_platelets_10_9_l_cont, c_calcium_mg_dl_cont, c_glucose_mg_dl_cont, c_lymphocyte_leukocyte_ratio_cont, c_alp_u_l_cont, c_protein_g_l_cont, c_alt_u_l_cont, c_albumin_g_l_cont, c_bilirubin_mg_dl_cont, c_chloride_mmol_l_cont, c_monocytes_10_9_l_cont, c_eosinophils_leukocytes_ratio_cont, c_ldh_u_l_cont, c_hr_cont, c_sbp_cont, c_oxygen_cont, c_neutrophil_lymphocyte_ratio_cont, c_bmi_cont, c_ast_alt_ratio_cont, c_time_dx_to_index, c_de_novo_mets_dx, c_height_cont, c_weight_cont, c_dbp_cont, c_year_index, dem_age_index_cont, dem_sex_cont, dem_race, dem_region, dem_ses

4.5.4 Comparison of stacked datasets

We can now stack the datasets (= vertically append them) and compare the resulting 10 x 10 datasets for any differences:

Show the code
waldo::compare(
  MatchThem::complete(data_mimids_from_function), 
  MatchThem::complete(data_mimids)
  )
✔ No differences

4.6 Session info

Script runtime: 1.05 minutes.

Show the code
pander::pander(subset(data.frame(sessioninfo::package_info()), attached==TRUE, c(package, loadedversion)))
  package loadedversion
cardx cardx 0.2.0
dplyr dplyr 1.1.4
gtsummary gtsummary 2.0.1
here here 1.0.1
MatchIt MatchIt 4.5.5
MatchThem MatchThem 1.2.1
Matrix Matrix 1.7-0
mice mice 3.16.0
smd smd 0.7.0
survey survey 4.4-2
survival survival 3.5-8
Show the code
pander::pander(sessionInfo())

R version 4.4.0 (2024-04-24)

Platform: x86_64-pc-linux-gnu

locale: LC_CTYPE=C.UTF-8, LC_NUMERIC=C, LC_TIME=C.UTF-8, LC_COLLATE=C.UTF-8, LC_MONETARY=C.UTF-8, LC_MESSAGES=C.UTF-8, LC_PAPER=C.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=C.UTF-8 and LC_IDENTIFICATION=C

attached base packages: grid, stats, graphics, grDevices, datasets, utils, methods and base

other attached packages: smd(v.0.7.0), cardx(v.0.2.0), gtsummary(v.2.0.1), survey(v.4.4-2), Matrix(v.1.7-0), MatchIt(v.4.5.5), MatchThem(v.1.2.1), mice(v.3.16.0), survival(v.3.5-8), dplyr(v.1.1.4) and here(v.1.0.1)

loaded via a namespace (and not attached): tidyselect(v.1.2.1), fastmap(v.1.2.0), digest(v.0.6.37), rpart(v.4.1.23), lifecycle(v.1.0.4), cluster(v.2.1.6), waldo(v.0.5.3), gdata(v.3.0.0), magrittr(v.2.0.3), compiler(v.4.4.0), sass(v.0.4.9), rlang(v.1.1.4), Hmisc(v.5.1-3), tools(v.4.4.0), gt(v.0.11.0), utf8(v.1.2.4), yaml(v.2.3.10), data.table(v.1.16.0), knitr(v.1.48), htmlwidgets(v.1.6.4), xml2(v.1.3.6), withr(v.3.0.1), foreign(v.0.8-86), purrr(v.1.0.2), nnet(v.7.3-19), fansi(v.1.0.6), jomo(v.2.7-6), colorspace(v.2.1-1), future(v.1.34.0), ggplot2(v.3.5.1), gtools(v.3.9.5), globals(v.0.16.3), scales(v.1.3.0), iterators(v.1.0.14), MASS(v.7.3-60.2), cli(v.3.6.3), rmarkdown(v.2.28), crayon(v.1.5.3), generics(v.0.1.3), rstudioapi(v.0.16.0), sessioninfo(v.1.2.2), commonmark(v.1.9.1), minqa(v.1.2.8), DBI(v.1.2.3), pander(v.0.6.5), stringr(v.1.5.1), splines(v.4.4.0), assertthat(v.0.2.1), parallel(v.4.4.0), base64enc(v.0.1-3), mitools(v.2.4), vctrs(v.0.6.5), WeightIt(v.1.3.0), boot(v.1.3-30), glmnet(v.4.1-8), jsonlite(v.1.8.8), mitml(v.0.4-5), Formula(v.1.2-5), htmlTable(v.2.4.3), listenv(v.0.9.1), weights(v.1.0.4), foreach(v.1.5.2), tidyr(v.1.3.1), glue(v.1.7.0), parallelly(v.1.38.0), nloptr(v.2.1.1), pan(v.1.9), chk(v.0.9.2), codetools(v.0.2-20), stringi(v.1.8.4), shape(v.1.4.6.1), gtable(v.0.3.5), lme4(v.1.1-35.5), munsell(v.0.5.1), tibble(v.3.2.1), anesrake(v.0.80), pillar(v.1.9.0), furrr(v.0.3.1), htmltools(v.0.5.8.1), R6(v.2.5.1), rprojroot(v.2.0.4), evaluate(v.0.24.0), lattice(v.0.22-6), markdown(v.1.13), cards(v.0.2.1), backports(v.1.5.0), tictoc(v.1.2.1), broom(v.1.0.6), renv(v.1.0.7), simsurv(v.1.0.0), Rcpp(v.1.0.13), gridExtra(v.2.3), nlme(v.3.1-164), checkmate(v.2.3.2), xfun(v.0.47) and pkgconfig(v.2.0.3)

Show the code
pander::pander(options('repos'))
  • repos:

    REPO_NAME
    https://packagemanager.posit.co/cran/linux/noble/latest

4.7